External validation of 87 clinical prediction models supporting clinical decisions for breast cancer patients

Introduction Numerous prediction models have been developed to support treatment-related decisions for breast cancer patients. External validation, a prerequisite for implementation in clinical practice, has been performed for only a few models. This study aims to externally validate published clinical prediction models using population-based Dutch data. Methods Patient-, tumor- and treatment-related data were derived from the Netherlands Cancer Registry (NCR). Model performance was assessed using the area under the receiver operating characteristic curve (AUC), scaled Brier score, and model calibration. Net benefit across applicable risk thresholds was evaluated with decision curve analysis. Results After assessing 922 models, 87 (9%) were included for validation. Models were excluded due to an incomplete model description (n = 262 (28%)), lack of required data (n = 521 (57%)), previously validated or developed with NCR data (n = 45 (5%)), or the associated NCR sample size was insufficient (n = 7 (1%)). The included models predicted survival (33 (38%) overall, 27 (31%) breast cancer-specific, and 3 (3%) other cause-specific), locoregional recurrence (n = 7 (8%)), disease free survival (n = 7 (8%)), metastases (n = 5 (6%)), lymph node involvement (n = 3 (3%)), pathologic complete response (n = 1 (1%)), and surgical margins (n = 1 (1%)). Seven models (8%) showed poor (AUC<0.6), 39 (45%) moderate (AUC:0.6–0.7), 38 (46%) good (AUC:0.7–0.9), and 3 (3%) excellent (AUC≥0.9) discrimination. Using the scaled Brier score, worse performance than an uninformative model was found in 34 (39%) models. Conclusion Comprehensive registry data supports broad validation of published prediction models. Model performance varies considerably in new patient populations, affirming the importance of external validation studies before applying models in clinical practice. Well performing models could be clinically useful in a Dutch setting after careful impact evaluation.


Introduction
Worldwide, over 2.2 million new cases of breast cancer were diagnosed in 2020 [1]. In the Netherlands, over 17,000 women and 100 men are diagnosed with breast cancer annually, making this the most commonly diagnosed cancer in women [2]. Even though breast cancer survival has improved throughout the past decades, the prognosis of an individual breast cancer patient strongly depends on patient-and tumor-related characteristics, and available treatment options [3].
To support (shared) decision-making by patients and clinicians regarding breast cancer treatment, prediction models have been developed that estimate the probability of certain outcomes using available patient-and tumor-related characteristics. An example of such a model is PREDICT [4], which is frequently used to support clinical decision-making on adjuvant systemic therapy.
Previously, a systematic literature review was performed to identify available prediction models that may provide valuable information to support treatment decision-making [5]. A total of 922 available prediction models were identified, which were developed to predict clinical outcomes such as treatment response, lymph node involvement, adverse events, recurrence, and (breast cancer-specific) survival. However, the majority of the identified models were found to be at high risk of bias according to the Prediction Model Risk Of Bias Assessment Tool (PRO-BAST) [6]. The clinical utility of most of these models remained unclear as a substantial number of models were not reported according to established reporting guidelines or showed methodological flaws during the development and/or the internal validation of the model. Prior to the use of prognostic models in a clinical setting, they should be validated both internally and externally on the target population [7]. Moreover, the clinical impact of the models on clinical practice should subsequently be assessed [8]. Still, for meaningful applications of prediction models, new models are more often developed than existing models are externally validated, and impact studies are performed even less, which means that potentially valuable information on the performance of a model is lacking [9]. This refrains existing models from being implemented in daily practice to support clinical decision-making in a certain population. However, when already available prediction models perform well on external data sets, the creation of new models will become less relevant than actually implementing valuable and validated models, and keeping these up to date [10]. Therefore, this study aims to evaluate the performance of previously identified prediction models using readily available data obtained from the Netherlands Cancer Registry (NCR).

Study population
The performance of identified clinical prediction models was evaluated using data obtained from the NCR. The NCR is a nationwide database comprising all newly diagnosed malignant tumors in the Netherlands. The data cohort consisted of patients diagnosed with breast cancer between 2003 and 2019. Invasive and non-invasive cancers were included, as well as female and male breast cancer patients. Patients were excluded if they were younger than 18 years old, or when the cancer was diagnosed during an autopsy.
Based on the patient group targeted by a prediction model, specific subgroups of patients were extracted from the full dataset to perform the model validation. To validate the different models, the definition of included variables, and the inclusion and exclusion criteria were applied as described in the original paper as much as possible.

Model selection
The previously identified 922 clinical prediction models, described in 534 papers were considered to be potential candidates for external validation and were selected based on four criteria.
First, models were selected in case sufficient details were reported to recover the underlying equation allowing the calculation of risks of the outcome for individual patients. For this, the underlying variable coefficients required to calculate the result of a model had to be available (or could be recovered from a nomogram), and all required covariates (input variables and outcome) should have been clearly defined.
Second, the required data, including both the input and outcome variables, for adequate validation of the model had to be available in the NCR.
Third, models were excluded when they were either developed by or previously validated on NCR data.
Fourth, models were excluded in case the available sample size within the NCR to validate the model was too low. For sample size considerations, the 100 events and non-events rule-of-thumb reported by Vergouwe et al. was initially used [11]. When the sample size was lower than 100 events and non-events (e.g. indicating a minimal requirement of 200 patients when the outcome occurs in 50% of the patients), additional calculations were performed according to the study by Riley et al. to determine if available data allowed validation [12].
Several assumptions were made in the data to allow more models to be validated. As the cause of death is not recorded in the NCR, patients who died with known metastatic breast cancer were assumed to have died due to breast cancer. The breast cancer subtype definition varies in different models. When no clear definition was provided in the paper describing the development of the model, the following definition was applied for breast cancer subtype; Luminal A (HR+ & HER2-), Luminal B (HR+ & HER2-), HER2-enriched (HR-& HER2+), and triple negative (HR-& HER2-). For models predicting a time-to-event outcome that may occur more than once (e.g. metastasis or locoregional recurrence), only the first event that occurred was taken into account.

Statistical analysis
All models were assessed on their performance in terms of discrimination, calibration, and net benefit. Discrimination concerns the ability of a model to stratify between high and low risk of the predicted outcome, and was quantified with the area under the receiving operating characteristic curve (AUC), and visualized using classification plots as proposed by Verbakel et al. [13] Discriminatory performance was considered poor (AUC<0.6), moderate (AUC:0.6-0.7), good (AUC:0.7-0.9), and excellent (AUC≥0.9). Calibration concerns the level of agreement between predicted and observed event rates and is visualized using calibration plots. Also, the Brier score and the scaled Brier score were estimated for each model. The Brier score concerns the squared differences between predicted and observed outcomes [14]. Brier scores range between 0 and 1, and a lower Brier score indicates better performance. The scaled Brier score compares the Brier score to the Brier score of an uninformative model (i.e. assuming the observed event rate is the predicted risk for all patients). A scaled Brier score <0 indicates that the model performs worse than an uninformative model. A higher scaled Brier score indicates better performance. A combination of the AUC and the scaled Brier score was used to categorize the overall performance of the models into poor (AUC<0.7 and scaled Brier≤0), moderate (either an AUC≥0.7 or a scaled Brier>0), and good (AUC≥0.7 and scaled Brier>0). Clinical usefulness was assessed by comparing the net benefit of applying the model over all feasible thresholds, and is visualized using decision curve analysis in which the added value of the model is compared to default strategies of treating all or no patients [15].
A separate dataset was created based on the original in-and exclusion criteria reported for each of the validated models. Missing data were assessed for each separate dataset and where appropriate, missing data were handled using multiple imputation by chained equations (MICE) [16]. Missing data were imputed on the complete dataset to ensure accurate estimations. The process of data imputation and model performance evaluation was repeated using 200 bootstrap samples.

Patient data
Data on 288,784 tumors diagnosed in 271,040 patients were obtained from the NCR. Patient characteristics from the data obtained from the NCR are displayed in Table 1

Model selection
All 922 models were initially considered for inclusion in our study. A total of 262 (28%) models were not described with sufficient details to calculate a risk for new patients (e.g. the original model equation could not be derived due to lack of reported model coefficients) and could not be validated. Another 521 (57%) models were excluded due to the unavailability of required input or outcome data in the NCR. Data most commonly resulting in the exclusion of a model were, race (n = 89), genetic data (n = 77), lymphovascular invasion (LVI) (n = 56), marital status (n = 54), Ki67 (n = 39), and lymphocytes (including tumor infiltrating lymphocytes and indices such as monocyte-to-lymphocyte ratio) (n = 31). Models developed or previously validated with NCR data (n = 45 (5%)) were also excluded, and lastly, 7 (1%) models were excluded as the available sample size was too low to validate these models. Finally, a total of 38 papers reporting on a total of 87 (9%) models were included in our external validation study. The process of inand excluding the models is visualized in the flowchart in Fig. 1.
An overview of the included models is provided in Table 2. A total of 33 (38%) models were developed to predict overall survival (OS), 27 (31%) models predicted breast cancer-specific survival (BCSS), 3 (3%) models other cause specific survival (OCSS), 7 (8%) models disease free survival (DFS), 7 (8%) locoregional recurrence (LRR), 5 (6%) predicted metastasis, 3 (3%) models lymph node involvement (LNI), 1 (1%) model pathologic complete response (PCR), and 1 (1%) model predicted surgical margin status. Several models were developed for a specific subset of patients. For instance, the models developed by Chen et al. (models 19a & 19b), were specifically aimed to provide BCSS predictions for male breast cancer patients. A short description of the specific patient subgroups per model is displayed in Table 2 and more detailed descriptions can be found in the supplementary tables.

Model performance evaluation
The performance of 87 models was evaluated. For each model, the AUC, and (scaled) Brier score were calculated, and a calibration plot, classification plot, and decision curve were visualized graphically (Supplementary data).

Discussion
In this study, a total of 87 prediction models were externally validated using data from the nationwide NCR and 34 (39%) models showed a good discriminative performance and calibration. On AUC alone, 41 (47%) models showed good performance (AUC ≥0.7), and on the scaled Brier score, 53 (61%) models showed a better performance than an uninformative model. The net benefit of the validated models was assessed using decision curve analysis. It is difficult to provide summary measures of the net benefit for the validated models as the relevant threshold probabilities are necessary to interpret the curve and the thresholds differ between models. Additionally, the threshold probabilities should not be selected based upon the results only displayed in a decision curve, but should be selected based on a clinically reasonable range, combined with the decision curve results [17]. Assessing these ranges was not the aim of the current study, but the provided decision curves can be used as input for future studies elaborating more on the clinical usefulness and impact of implementing one or more of the included models in clinical practice.
To validate the included models, several assumptions had to be made due to the lack of a complete and transparent description of the model in the underlying paper. For instance, the models 18a & 18b developed by Wen et al. predict 5-and 10-year BCSS, respectively, using the log odds of positive lymph nodes as a predictor [18]. The paper provided a definition of this predictor, but did not provide a base value for the logarithmic transformation. Also, Wen et al. [18] presented their model in a nomogram in which the log odds has to be entered as a value between 1 and 4, but no transformation of the predictor was provided. The poor performance of the model may be caused by this lack of transparency and a potentially useful model is not advised to be applied in clinical practice yet. Similar difficulties were identified for the validation of the models 7a-7c provided by Zhao et al. [19] where there were some ambiguous definitions regarding both the predictors and the outcome. For instance, both OS and BCSS were used interchangeably as the outcome, and no proper definitions were provided for variables for which different definitions exist, including oligo-metastasis, breast cancer subtype, or advanced breast. As the cause of death is not   available in the NCR, disease-specific mortality was assumed to occur when the patient died while being diagnosed with metastatic disease. The adequate performance found in multiple models predicting BCSS indicates that this assumption was appropriate. Several papers described multiple models that predicted OS and BCSS for metastatic breast cancer patients, such as the models 10a -10d and 11a -11l. Due to our definition of BCSS, the dataset used to validate these models was exactly the same (including the OS and BCSS outcomes). Still, differences found in model performance were small and insignificant so we do not expect that this assumption has negatively impacted our results. The design of the validated models affected the performance measures. For instance, model 23 incorporated LVI as a predictor, where missingness of the predictor was dealt with by modelling "unknown" as a possible input option. However, the coefficient for "unknown" was lower than the other possible input options for the predictor (i.e. LVI or no LVI). As a result, predicted probabilities were lower for all patients compared to a situation in which the predictor values would not be missing, due to the fact that LVI was missing entirely in the NCR. Also, the predictor had no discriminative value this way, as it was equivalent in all patients. Another remarkable finding concerns the models 9d -9f predicting BCSS over 3, 4, and, 5-year, respectively, where the predicted probability can be higher after 5-years than after 3 or 4 years. It becomes difficult to explain and interpret these results well when applying these models for patient care, regardless of their performance.
The inclusion and exclusion criteria of the original models were applied as much as possible, but some discrepancies were found between the described criteria in the papers describing the development of the models and the group of patients for which the models could be applied. For instance the models 20a and 20b described by Fu et al. [20] include the location of the tumor in the breast as a predictor (e.g. axillary tail, central, lower inner, lower outer, upper inner, or upper outer), but the data in the NCR also include patients with a tumor in an overlapping region. As it was unclear how Fu et al. dealt with these patients, these patients were excluded from the subgroup used for validation of this  [20].
Although models often predicted the same outcome, these models could barely be compared to each other as their target patient population varied. For instance, model 19 was intended for male breast cancer patients, while others were developed for more general populations. This discrepancy in patient selection criteria may partly account for the variations in model performance. However, poor model performance can also be due to the methodology used to develop and (internally) validate the models. As we previously reported in our systematic review, many prediction models for breast cancer were considered to be at high risk of bias of which Venema et al. [21] demonstrated they perform worse on external validation compared to models with a low risk of bias.
A strength of the current study concerns the large data set used to validate the models. In addition, due to the inclusion of as many identified prognostic models as possible, a total of 87 models could be validated. Given that a total of 922 models were initially considered for external validation, the number of 87 models seems to be low. The majority of the models could not be validated with NCR data due to the unavailability of several required variables such as race, genetic data, LVI, marital status, Ki67, and lymphocytes (including tumor infiltrating lymphocytes and indices such as monocyte-to-lymphocyte ratio). As these data were incorporated in many different models, it is likely to assume that they provide relevant prognostic information and may become valuable additions for future data collection in the NCR or other registries. On the other hand, successful adoption of clinical prediction models relies on both performance and applicability. A model that performs very well, but requires input data that is not routinely collected may be less likely to be widely adopted in clinical practice. The NCR provided a large database with many relevant data items, but some of the commonly missing variables were missing for various reasons. For instance, due to a lack of consistency in definitions of cutoffs and methods to estimate Ki67 [22], the variable is not routinely collected. However, inclusion of predictors such as marital status and race can be considered controversial, and may lead to undesirable effects in addressing disparities [23]. Alternative modelling methods may be applied to improve the applicability of prediction models without losing too much of its predictive performance by e.g. creating submodels in which the users of the models are enabled to still use the model when one or more of the predictors are not available, although estimates will become a little less accurate (reflected in larger confidence intervals) [24].
Multiple models showed a good performance in Dutch breast cancer patients. However, before these models can be used in clinical practice, additional analyses are advised. A potentially useful next step concerns the update and re-calibration of likely valuable models. Subsequent impact studies could further define the value of incorporating some of the validated models in clinical practice. Cost-effectiveness analyses are often omitted, but are perfectly capable of estimating the actual benefits to patients and to the healthcare system when models are used in practice [9]. As highlighted by Vickers et al. a model with good performance does not necessarily indicate a valuable model [17]. In order to assess the value of models, a description of the intended use of the model is required, which should clearly indicate which decision can be supported with the model. For example, a model with a moderate performance may prove valuable if there are no alternatives available, but if there are multiple models with the same intended use, the best performing model on validation should be considered for implementation. Additionally, in the European Union, the use of web-apps to calculate patient-tailored predictions to inform clinical management requires the certification of the software incorporating the model under the medical devices regulation [25]. Developers should take into account the different steps needed to get valuable decision support into clinical practice even before models are developed to improve the efficiency and impact of prediction model development.

Conclusion
The external validity of 87 prediction models to support treatment decisions of breast cancer patients was assessed. On a large Dutch registry dataset, 34 (39%) models showed a good performance, 26 (30%) models showed a moderate performance, and 27 (31%) models showed a poor performance, according to our predefined definitions. From the models showing good performance, 14 (41%) predicted BCSS, 13 (38%) predicted OS, 3 (9%) predicted OCSS, 2 (6%) predicted metastasis, 1 (3%) predicted DFS, and 1 (1%) predicted LRR. These results allow the next step towards clinical use. After careful evaluation to assess the impact of incorporating the models with a clear intended use in a useable tool, clinical adoption in the Dutch health care setting can be justified.

Funding source
The study was performed without study sponsors.

Ethical approval
Ethical approval was not required for this study.

Data availability
This study used the data from the Netherlands Cancer Registry. Data are available upon request at the Netherlands Comprehensive Cancer Organisation (IKNL) via https://iknl.nl/en/ncr/apply-for-data.